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Abstract 

We perform forward error analysis for a large class of recursive matrix multiplication algorithms in 
the spirit of [D. Bini and G. Lotti, Stability of fast algorithms for matrix multiplication, Numer. Math. 
36 (1980), 63-72]. As a consequence of our analysis, we show that the exponent of matrix multiplication 
(the optimal running time) can be achieved by numerically stable algorithms. We also show that new 
group-theoretic algorithms proposed in [H. Cohn, and C. Umans, A group-theoretic approach to fast 
matrix multiplication, FOGS 2003, 438-449] and [H. Cohn, R. Kleinberg, B. Szegedy and C. Umans, 
Group-theoretic algorithms for matrix multiplication, FOGS 2005, 379-388] are all included in the class 
of algorithms to which our analysis applies, and are therefore numerically stable. We perform detailed 
error analysis for three specific fast group-theoretic algorithms. 

1 Introduction 

Matrix multiplication is one of the most fundamental operations in numerical linear algebra. Its importance 
is magnified by the number of other problems (e.g., computing determinants, solving systems of equations, 
matrix inversion, LU decomposition, etc.) that are reducible to it (see [51 Chapter 16]). 

Starting from Strassen's result Jj] that two square nxn matrices can be multiplied in 0{n'^^^^) operations, 
a sequence of improvements was made to achieve ever better bounds on the exponent of matrix multiplication, 
which is the smallest real number lu for which nxn matrix multiplication can be performed in 0{n'^^^) 
operations for any 77 > 0. The complexity of the fastest known method to date due to D. Coppersmith 
and S. Winograd [TD] is about 0(n^-^*). A new approach based on group-theoretic methods was recently 
developed in [S] and [7], along with several ideas that can potentially reduce the bound on w to w = 2 
(obviously, w cannot fall below 2, since 0{n^) operations are required just to read off the entries of the 
resulting matrix). 

Along with computational cost, numerical stability is an equally important factor for the implementation 
of any algorithm, since accumulation and propagation of roundoff errors may otherwise render the algorithm 
useless. It is the purpose of this work to analyze recursive fast matrix multiplication algorithms generalizing 
Strassen's algorithm, as well as the new class of algorithms described in 9 and [J, from the stability point 
of view. The rounding error analysis of Strassen's method was initiated by Brent ([ll[I3j, [HI chap. 23]). In 
our analysis, we rely on earlier work by Bini and Lotti [2]. These results do not apply directly to our setup, 
because they do not account for errors from multiplicative constants, for nonstationarity in subdividing 
matrices, and for additional pre- and post-processing operations (which appear in the methods considered 
here but not in Strassen's method). However, we are able to refine the approach of Bini and Lotti to build a 
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sufficiently inclusive framework, within which the new algorithms proposed in [S] and [7] can be analyzed in 
detail. Combining this framework with a result of Raz |17j also allows us to prove that there exist numerically 
stable matrix multiplication algorithms which perform 0{n'^~^^) operations, for arbitrarily small 77 > 0. 

The definition of stability used in this paper measures errors normwise - see inequality This is 
weaker than the componentwise bound satisfied by conventional matrix multiplication [141 eqn. 3.13]. In 
fact. Miller [16j showed that any algorithm satisfying the componentwise bound must do at least arithmetic 
operations. The impact of a normwise error bound on other algorithms depending on matrix multiplication 
has been investigated in [T^l E] . In [TT] we take up the question whether other linear-algebraic algorithms 
exist that are "stable" in some sense and just as fast as the algorithms considered in this paper. 

This paper is organized as follows: In Section [5] we discuss the model of arithmetic and algorithms that 
is used in the rest of the paper, along with some basics on forward error bounds. In Section [3] wc introduce 
and analyze from the stability point of view a wide class of recursive algorithms for matrix multiplication. 
We begin by discussing Strassen-like algorithms, based on recursive partitioning of matrices into the same 
number of blocks. We prove that all such algorithms are stable, and that this class of algorithms contains 
algorithms with running time 0{n'^^^) for arbitrarily small positive 77. We then generalize our analysis to 
algorithms where the number of blocks depends on the level of recursion, and finally to algorithms involving 
additional preprocessing before and postprocessing after partitioning into blocks. In Section 2] we use this 
approach to analyze the group-theoretic algorithms from [9] and [7]- In particular, in Section [4. 41 we perform 
detailed stability analysis for three specific classes of algorithms introduced in ^ and [7] . The paper ends 
with a brief discussion of other fast linear algebra algorithms in Section [5l 

2 Model of arithmetic and algorithms 

We adopt the classical model of rounded arithmetic, where each arithmetic operation introduces a small 
multiplicative error, i.e., the computed value of each arithmetic operation op(a, b) is given by op{a, b){l + 9) 
where \9\ is bounded by some fixed machine precision s but is otherwise arbitrary. The arithmetic operations 
in classical arithmetic are — , •}. All of the analysis in this paper applies to matrices with either real of 
complex entries, i.e. we interpret the operands in these arithmetic operations as being either real or complex 
numbers. We assume that the roundoff errors are introduced by every execution of any arithmetic operation 
(in contrast to [5], where it is assumed that multiplication by entries of the auxiliary coefficient matrices 
U, V and W is error- free). We further assume that all algorithms output the exact value in the absence of 
roundoff errors (i.e., when all errors 6 are zero). 

For simplicity, let us denote by the set of all errors 9 bounded by e and by A the set of all sums 
{1 + 6' : 9 G 0}. We use the standard notation 

A + B :={a + 6 : a e A, 6 e B}, A - B :={a - 6 : a G A, 6 e B}, A ■ B :={a ■ b : a £ A, beB} 

for the algebraic sum/difference/product of two sets A and B. We will also use the notation A^ for the set 
A ■ A ■ ■ ■ A.. Note that the error sets are ordered by inclusion: A^ C A-'^^ for all j S Z+. 



We now state the most basic error bound that will be used repeatedly throughout this paper. Suppose a 
branch-free algorithm performs a number of arithmetic operations to compute a polynomial / in the inputs 
X = (xi, . . . , Xn). Then the resulting computed value fcomp is a function of both x and the errors 9. Moreover, 



for some polynomials fj in x. Suppose v is the maximum of the exponents j occurring in the terms A-' and 
m is the number of summands in the expression for fcomp{x). If the algorithm outputs the correct value 
f{x) in the absence of roundoff errors, then ([T]) implies 



j terms 




(1) 



fcompix) - f{x)\ < mmax |/j(x)|((l -I- e)" - 1) m max | (a;) | i^e -I- 0(£^). 



(2) 
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In particular, suppose that addition of n quantities a;i,...,x„ is performed by running the classical 
parallel [log2 rt]-step algorithm as a straight-line algorithm, i.e., computing the total sum by adding the two 
sums ^3 ^^'^ Sj=[n/2]+i ^j' ^^"^ recursively computing each of the two sums in the same manner. 

Then the resulting computed value ^compix) lies in the set 

n 

hence 

\T,comp{x) -y^Xjl < max|a::j|[log2n]e + 0(e^). (3) 
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3 Error analysis for recursive matrix multiplication algorithms 

In this section, we perform forward error analysis for three classes of recursive matrix multiplication al- 
gorithms, starting with the Strassen-like algorithms based on stationary partitioning, then generalizing to 
algorithms with non-stationary partitioning, and finally to the algorithms of the kind developed in [5] and [3- 
The error analysis in Sections 13 . II and [3 . 21 is done with respect to the entry- wise maximum norm on A, B, 
C — AB, while the analysis in Section [3.31 is for an arbitrary matrix norm satisfying an extra monotonicity 
assumption. All our bounds are of the form 

\\C,omp - C\\ < fi{n)e\\A\\ \\B\\ + 0{e^), (4) 

with typically low degree polynomials in the order n of the matrices involved, so that ii{n) = 0{n'^) 
for some constant c. Note that one can easily switch from one norm to another at the expense of picking up 
additional factors that will depend n, using the equivalence of norms on a finite-dimensional space. 

Later, in Section 14.41 we will give values of the exponent c for sample algorithms. Here, we argue that 
the exact value of c does not greatly impact the complexity of practical matrix multiplication, in the sense of 
the bit-complexity for computing AB to a desired accuracy || Ccomp — C\\ < eoll^ll ■ II ^11, for a given eo < 1- 

Since ||C|| can be about as large as \\A\\ ■ \\B\\, the bound ^ is interesting only when fi{n)e < 1. Any 
algorithm will have c > 1, since even just straightforwardly computing a dot product has c — 1. Thus 
< 1, so ne < 1, and 6:=log2(l/e) > log2 n, where b is the number of bits used to represent the fractional 
part of the floating point numbers. 

Now suppose we want to choose e just small enough (6 just large enough) to guarantee \\Ccomp ~ C|| < 
eoll^ll • and ask how the complexity of the resulting algorithm depends on the exponent c. Setting 

eo = n'^i, we get b = log2(l/e) = log2(l/eo) + c • log2 n > log2 n, i.e. the number of bits 6 needed grows 
proportionally to log2 n. The cost of &-bit arithmetic is in the range from 0{b^) (done straightforwardly) 
down to 0(6^+°*^^^) (using Schonhage-Strassen [H])- Therefore, the bit-complexity of computing the answer 
with error proportional to eo will be at most a polylog(n) factor larger than the bound 0{n'^) gotten from 
ignoring bit-complexity, and only slightly superlinearly (up to quadratically) dependent on c. 



3.1 Stationary partition algorithms 

We next recall some basic notions related to recursive matrix multiplication algorithms. This section is 
closely related to the paper by Bini and Lotti. However, our approach is more inclusive, as will be 
explained later in this section as we develop pertinent details. 

We consider recursive algorithms for matrix multiplication. A bilinear non-commutative algorithm (see 
[2] or [5]) that computes products of fc x matrices C = AB over a ground field F using t non-scalar 
multiplications is determined by three k'^xt matrices U, V and W with elements in a subfield H C F such 
that 

/k^ \ /k^ \ 

Chi^'^WrsPs, where Ps ^ UisX,; X!"^''^-'' ' r^k{h-l)+l, h,l ^l,...,k, (5) 
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where Xi (resp. yj) are the elements oi A ^ (^.tj) (resp. of B = (6^)) ordered column-wise, and C — (cy) is 
the product C = AB. 

For an arbitrary n, the algorithm consists in recursive partitioning and using formula ^ to compute 
products of resulting block matrices. More precisely, suppose that A and B are of size nxn, where n is a 
power of k (which can always be achieved by augmenting the matrices A and B by zero columns and rows) . 
Partition A and B into square blocks Aij, Bij of size {n/k)x{n/k). Then the blocks Cm of the product 
C = AB can be computed by applying (O to the blocks of A and -B, where each block Aij , Bij has to 
be again partitioned into k^ square sub-blocks to compute the t products Pg and then the blocks Cm ■ The 
algorithm obtained by running this recursive procedure logj, n times computes the product C = AB using 
at most 0{n^°^'' *) multiplications. 

Now we are in a position to analyze recursive matrix multiplication algorithms. We first look at the 
outermost recursion, denoting the blocks of A ordered column-wise by Xi and the blocks of B ordered 
column-wise by Yj. We will index the levels of recursion by j = I, . . . ,p:— log^ n, increasing as we go down. 
Since multiplication by an element of C/ or V introduces a multiple of 1 + 9 for some 6 G & and since ^ 
and ^ hold, the computed value Pl^lomp for each quantity pj^' is obtained by running the fast matrix 
multiplication algorithm on the obtained pairs of {n/k)x(n/k) matrices 

k"- 

i=i j=i 

where as := [logj a^] , (3s := riog2 bs] , and 

fls :=#{wjs : Uis^O, i = l,.. . ■.= #{vjs : Vjs / 0, j = 1, . . . , /c^} for s = 1, . . . , t. 

The matrices A\^}comp, Pi^lomp are further partitioned and the same procedure is applied to the obtained 
blocks, etc., log^, n times, until the resulting blocks all have size 1x1. To see how the errors propagate, note 
that if the inputs to ([5]) are given as sums of certain matrices A^, B^, each with a possible error in A", A'^, 
respectively (i.e., as elements of the sets X^a,^ ^^'^"i X^s^ B^A.^), then the resulting inputs at the next 
level are elements of the sets 

EE"-^»(^^)^"^"°^'' EE"J-^J(^^)^''^''°^'' respectively. 

i=l j=l 

Thus by going all the way down to the logj. n level we multiply each element of the original input matrix A 
by error terms in A'^^+"=^ '"^^ " and by logj. n elements of U. Likewise, each element of B is multiplied by 
error terms from the set A^^+'^'''' " and logj, n elements of V. 

Now, at the lowest level of our recursive scheme, we begin to put together quantities Pl^lomp- The lowest- 
level computation of Pl^lomp is simply a scalar multiplication, so it brings in an additional factor from A. 
Then the quantities Ch^^i^^comp are computed by ([5]). Note that, in general, 

t 

Cjil^comp G ^ ^ ^rsPsjComp^^ ; (7) 

3=1 

where 7^ :=[log2 c^] and 

Cr := i(={wrs ■ Wrs ^ 0, s ^ 1,. . . ,t} for r ^ I, . . . ,k'^. 

Also, Ugbs terms of the kind WrsUisVjsXiYj need to be added to produce Ps,comp from the products (XiYj) 
using formula ([6]). So J2l=i(^sbs^rs terms containing a product Xiyj, where Xi is an entry of A, yj is an 
entry of B, are added to produce Chpip,comp by formula ([7]), where 



1 Wrs ^ 
Wrs = 0. 
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Following [5], we denote by e the vector with components e^ :=X]l=i o,sbsS.rs, and by emax the maximmii 
emax := max^ e^. 

As the second part of the recursive procedure is run from the bottom to the top, we "assemble" all 
the blocks Chj.ij,comp from the blocks at the previous level. When the algorithm terminates, each resulting 
element of C is then determined by the choice of block indices (/ii, li), . . ., {hp, Ip) and is the sum of ■ ■ ■ e^-p 
terms Chi,ii,...,hp,ij,,comp, where : = (/iq - l)fc + Each term Chi,h,...Mp,ip,comp is a product of an element 
of A and an element of B, an element of where /i is at most 1 + max,._s(as + A + 7r + 3) log^, n, and 
logj, n elements of U, of V and of W. Using the maximum-entry norm ||M|| :=maxij |mij| for a matrix M, 
we therefore arrive at the bound 

\\C,omp - C|t < (1 + max(a, + /3, + 7, + 3) log, n) ■ (emax • \\U\\ \\V\\ \\W\\y°^- "||A|| ||B||£ + 0{e^). (8) 

We can now summarize this formally as a theorem. 

Theorem 3.1. A bilinear non- commutative algorithm for matrix multiplication based on stationary parti- 
tioning is stable. It satisfies the error bound where \\ ■ \\ is the maximum-entry norm and where 

H{n) = (1 + max(a, + + 7^ + 3) log^^ n) ■ (emax • \\U\\ \\V\\ 

r.s 

Remark 3.2. Note that in TheoremO ^i{n) = O („i°gfc(™ax-|H/|| ||iv||)+o(i) j _ r^j^-g confirms our state- 
ment, following equation that the term /i(n) in our error bound is polynomial in n. 

Theorem 3.3. For every rj > there exists an algorithm for multiplying n-by-n matrices which performs 
0(n'^+''') operations (where co is the exponent of matrix multiplication) and which is numerically stable, in 
the sense that it satisfies the error bound ^ with /x(n) = O(n^) for some constant c depending on rj but not 
n. 

Proof. It is known that the exponent of matrix multiplication is achieved by bilinear non-commutative 
algorithms |17j . More precisely, using the terminology of TT , for any arithmetic circuit of size S which 
computes the product of two input matrices A, B over a field of characteristic zero, there is another arithmetic 
circuit of size 0(5*) which also computes the product of A and B and is a bilinear circuit, meaning that it has 
the following structure. There are two subcircuits Ci, C2, whose outputs are linear functions of the entries 
aij (resp. hij). Then there is one layer of product gates, each of which multiplies one output of Ci with 
one output of C2. Then there is a subcircuit C3 whose inputs are the outputs of these product gates, and 
whose outputs are the entries of the matrix product. The only operations performed inside subcircuits Ci, 
C2, C3 are addition and scalar multiplication. Every bilinear circuit corresponds to a bilinear noncommutative 
algorithm as expressed in ([5]) the outputs of Ci, C2 are the linear forms \^UisXi\, {^Vjsyj}, respectively. 
The product gates compute the numbers Pg. The circuit C3 computes the linear forms ^ WrsPs- 

By the definition of iv, for some constant C there exist arithmetic circuits of size less than Ck^^^^"^ 
which compute a, k x k matrix product, for every k. By the preceding paragraph, we can assume that these 
circuits are bilinear circuits. This means that for every k there exists a bilinear noncommutative algorithm 
for k X k matrix multiplication using t < Cfc"+''/^ non-scalar multiplications. Choose fco large enough 
that C/cq^''^^ < fco^''. Using the bilinear non-commutative algorithm for this value of fcg and applying 
Theorem 13. 11 we obtain the theorem. □ 

3.2 Non-stationary partition algorithms 

The analysis from the preceding section generalizes easily to bilinear matrix multiplication algorithms based 
on non-stationary partitioning. In that case, the matrices A^Jjcomp and B'^i}comp are partitioned into kxk 
square blocks, but k depends on the level of recursion, i.e., k ~ k{j), and the corresponding matrices U, V 
and W also depend on j: U — U{j), V = V{j), W — W{j). Otherwise the algorithm proceeds exactly like 
in the previous section. Suppose such an algorithm applied to nxn matrices requires p levels of recursion, 
so that 11^=1 ^U) ~ For each level j, we can define quantities ctsij) analogously to as, Ps{.j) analogously 
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to einax(j) analogously to emax. We then use the same reasoning as above to obtain the following error 
bound for non-stationary partition algorithms: 



Ccomp-C\\ < (1 + Vmax(a,(j)+/3,0')+7^(j)+3)) 

^ r.s 



r,s 
3 



X j^nemax(j)||t/(j)ll WViM \\W{j)\\j \\A\\ \\B\\e + 0{e'). (9) 

Theorem 3.4. A bilinear non- commutative algorithm for matrix multiplication based on non- stationary 
partitioning is stable. It satisfies the error bound ^ where \\ ■ \\ is the maximum- entry norm and where 

^i{n) = (l + ^max(a,(j)+A(j)+7.(j)+3)) j I]emaxO-)l|;^(j)ll ll^j)!! Wm 

3.3 Algorithms that combine partitioning with pre- and post-processing 

Finally consider algorithms that combine recursive non-stationary partitioning with pre- and post-processing 
given by linear maps Pre„() and Post„() acting on matrices of an arbitrary order n. More specifically, 
the matrices A and B are each pre-processed, then partitioned into blocks, respective pairs of blocks are 
multiplied recursively and assembled into a large matrix, which is then post-processed to obtain the resulting 
matrix C (see Section 14.21 for concrete examples of pre- and processing operators) . 

We assume again that the partitioning is non-stationary, i.e., that at level j of the recursion all matrices 
are of order nj ■=Y[i>j ^(0 ^^id are partitioned into tj blocks of size Uj^i. (At the lowest pth level of 
recursion, rip = 1, while at the top level of recursion, ni — n, i.e., 0^=1 ^U) — 

For this analysis, we will be working with a consistent norm || • || defined for matrices of all sizes and 
satisfying the condition 

max||Af,|| < ||Af|| < ^ ||M,|| (10) 

s 

whenever the matrix M is partitioned into blocks {Ms)s (an example of such a norm is provided by || • II2). 
Note that the previously used maximum-entry norm satisfies pop but is not consistent, i.e., fails to satisfy 

\\AB\\ < \\A\\ ■ \\B\\ for all A, B. 

We denote the norms of pre- and post- processing maps subordinate to the norm || • || by || • \\op. Suppose 
that the pre- and post-processing is performed with errors 

||PRE„(M)eomp-PRE„(Af)|| < fprein)e\\M \\+0{e^) , 1 1 PosT„ ( A/),o„p -PosT„ (M) || < fpostin)e\\M\\+0{e^), 

where n is the order of the matrix M . As before, we denote by iJ-{n) the coefficient of e in the final error 
bound dH). 

The function /i can be found recursively as follows. Consider one level of recurrence where matrices of 
order n are partitioned into, say, t matrices of order n/k. Denote the matrix Pre„(A) by A and PRE„(i3) 
by B. The computed matrix Acomp {Bcomp, resp.) is within /pre("-)e||^|| {fpre{n')s\\B\\, resp.) from A (B, 
resp.). The matrices A and B are further partitioned, which does not introduce additional errors. Thus 

\\As,comp -M< fpre{n)e\\A\\ + 0{e^), ||4,co,„p -Bs\\< fpre{n)e\\B\\ + 0(£2). 

The blocks Ag^comp and Bg comp are then multiplied recursively, which, for each pair of blocks, introduces an 
error of size fJ,{n/k)e\\As^comp\\ \\Bs,comp\\- Denoting the computed products by Cg^comp, we thus obtain 

l|C*s^comp Ag^QQjjipBg^f^QjYipW /^(^/^)^^ II ^s, comp 1 1 ||-^s, compH 0(^8 ) 

< fi{n/k)e\\As\\\\Bs\\ + 0{e^) 

< f,{n/k)e\\A\\\\B\\+0{e') 

< f,{n/k)e\\PBBjlp\\ A\\\\ B\\+Oie^). 
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We now apply the triangle inequality to evaluate HCs, comp — Rewriting Ag as As^comp+ [As ~ As.comp) 
and Bs as Bs,comp + {Bg - Bs^comp), we get 

||C's^COmp — ||C's^COT7ip Ag QQJJipBg QQJJipW \\Ag QQjyipBg QQJYip J^gi^gH 

< fi{n/k)e\\PREjlp \\A\\ \\B\\ + 2fpre{n)e\\A\\ \\B\\ + 0{e'). 

Summing up over all s = 1, . . . ,t and taking into account the assumed properties of the norm || • ||, we 
therefore obtain 

WCcomp '~C\\<J2 W^s^cornp ' Cs\\ < t (^(n/fc)e || Pre„ 1 1 \\A\\ \\B\\ + 2fpre{n)e\\A\\ \\B\\) + 0{e^). 

s 

Finally, the post-processing step rescales the obtained error by the norm ||Post„|| op and adds another error 
term of order 

fpostin)\\Cco^p\\e < fpost{n)\\PiiEn\\lp \\A\\ \\B\\ + 0(6^). 
Altogether, this gives the recurrence 

/Lt(n) = ^l{n/k)t\\POSTn\\op ||PRE„||op + 2fpre{n)t\\POST,i\\op + fpost{n)\\PREn\\lp. 

The same argument is applicable to each level j of recurrence, with n replaced by rij, n/k replaced by rij+i, 
and t replaced by tj . We now state this formally as a theorem. 

Theorem 3.5. Under the assumptions of this section, a recursive matrix multiplication algorithm based on 
non- stationary partitioning with pre- and post-processing is stable. It satisfies the error bound with the 
function fi satisfying the recursion 

HiUj) = ^(nj + i)tj||POST„. Hop ||PRE„^ Hop + 2fpre{nj)tj\\POSTn, \\op + /post ("j ) 1 1 PREn^ Hop , J = 1, ■ • ■ 

4 Group-theoretic recursive algorithms 

To perform the error analysis of the group-theoretic matrix multiplication algorithms defined in [9] and [7] , 
we must first recall some definitions and facts about those algorithms. The relevant material is reviewed in 
Section 14. f I In Section 14.21 we define the class of group-theoretic algorithms — called abelian simultaneous 
triple product (abelian STP) algorithms — and we introduce a running example (i.e. a specific algorithm 
in this class) for the purpose of concreteness. This class of algorithms encompasses all of the fast matrix 
multiplication algorithms described in [7], and is a special case of the group-theoretic algorithms defined 
in [3]. We refer the reader to [5] for a proof of the correctness of these algorithms, as well as an analysis 
of their running time. In Section 14.31 we will apply the analysis from Section [3] to derive error bounds for 
abelian STP algorithms. In Section we will cite some specific examples of such algorithms and evaluate 
their error bounds. 

4.1 Background material 

We begin by recalling some basic definitions from algebra. 

Definition 4.1 (semidirect product). If H is any group and Q is a group which acts (on the left) by 
automorphisms of H, with q ■ h denoting the action of q G Q on G iJ, then the semidirect product H ><\ Q 
is the set of ordered pairs {h, q) with the multiplication law 

{hi,qi){h2,q2) = {hi{qi ■ /i2),gi'72)- (11) 

We will identify H x {1q} with H and {Ih} x Q with Q, so that an element {h,q) G i? x Q niay also be 
denoted simply by hq. Note that the multiplication law of TJ x Q implies the relation qh — {q ■ h)q. 
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Definition 4.2 (wreath product). If H is any group, S is any finite set, and Q is a group with a left action 
on S, the wreath product H I Q is the semidirect product (H^) xi Q where Q acts on the direct product of 
5*1 copies of H by permuting the coordinates according to the action of Q on S. (To be more precise about 
of the action of Q on if an element h e is represented as a function h : S ^ H, then q ■ h represents 
the function s h(q^^{s)).) 

Example 4.3 (running example, part 1). Throughout this section, we will work with a running example 
of an abelian STP algorithm based on a specific finite abelian group H with 4096 elements, and its wreath 
product with a two-element group. Consider the set S':={0, 1} and a two-element group Q whose non- 
identity element acts on S by swapping and 1. Let H be the group (Z/16)^ whose elements are ordered 
triples of integers {xq, Xi, X2) modulo 16. An element of is an ordered pair of elements of H, which can 
be represented as a 2-by-3 matrix 

/ a;oo xoi X02 \ 

\ Xio Xii X12 J 

of integers modulo 16. An element of H } Q is an ordered pair {X,q) where A" is a matrix as above, and 
q = ±1. An example of the multiplication operation in H I Q is given by the formula for (A, — 1) • (Y, —1): 

xoo xoi xq2 \ _A f f yoo Vol V02 \ ^ ( ( 2^00 + yio xoi^-yii X02 + yu \ A 

VV 2:10 xii X12 ) ' ) \\ Vw Vii V12 ) ' ) VV 2^10 + yoo xii+ym X12 + ym )/' 

Notice that the rows of Y were swapped before adding it to A. 

An alternative description of _ff ; Q is that it has generators ao, ai, 60, 61, cq, Ci satisfying the following 
relations: 

1. flo, ai, 60, bi, Co, ci collectively generate the group (Z/16Z)^. 

2. q^ is the identity element. 

3. qao = aiq, qbo = hq, qco = ciq. 

Example 4.4. This example generalizes the preceding one. When S is the set {l,2,...,n}, we use the 
notation Sym„ to denote the group of all permutations of S, acting on S in the obvious way, i.e. tt • s = 7r(s). 
Each element of the wreath product H I Sym„ may be uniquely represented as a product hn where tt S Sym„ 
and h = (hi, /i2, . . . , hn) € i?". The multiplication law of H I Sym„ is given by the formula: 

{h7r){h'Tr') = h{TT ■ h')TnT' = (hih'^^i(^i),h2h'^-,^^^, . . . , /i„/i;-i(„)) tttt' . (12) 



Next we recall some definitions and theorems from 9 and If S*, T are subsets of a group G, we use 
the notation Q{S,T) to denote their right quotient set, i.e. 

Q{S,T):^{st-^ : s£S,teT}. 

We use the notation Q{S) as shorthand for Q(S, S). 

Definition 4.5 (triple product property, simultaneous triple product property). If is a group and A, Y, Z 
are three subsets, we say A, Y, Z satisfy the triple product property if it is the case that for all qx E Q(A), qy G 
Q(X),qz G QiZ), if qxqyqz = 1 then q^ = qy = q^ = 1. 

If {{Xi,Yi, Zi) : I £ J} is a collection of ordered triples of subsets of H, we say that this collection 
satisfies the simultaneous triple product property (STPP) if it is the case that for all A; G / and all 
qx e Q{Xi,Xj),qy e Q{Yj,Yk), qz e Q{Zk, Zi), if qxqyqz = 1 then qx = qy = qz = ^ and i = j = k. 
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Example 4.6 (running example, part 2). In our rmining example, the group H is (Z/16Z)'^. Consider the 
following three subgroups of H . 

X := (Z/16Z) X {0} X {0} 
Y := {0} X (Z/16Z) X {0} 
Z := {0} X {0} X (Z/16Z) 

We claim that X, y, Z satisfy the triple product property. Since H is an abelian group, we will denote the 
group operation and the identity element using additive notation rather than multiplicative notation. Thus 
the triple product property is the assertion that if G Q{X), qy £ Q{Y), G Q{Z), and qx + Qy + Qz = 0, 
then qx — qy ^ Qz = 0. Note first that Q{X) = X,Q{Y) = Y,Q{Z) = Z because X,Y,Z are subgroups. 
The elements qy, qz have in their first component, so the first component of qx + Qy + Qz is equal to the 
first component of qx- This shows that the first component of is 0, which implies that qx = 0- By similar 
arguments, qy = and Qz — 0, which confirms the triple product property. 
Now consider the following six subsets of H: 

Xo :={!, 2, . . . , 15} X {0} x {0} Xi :={0} x {1, 2, . . . , 15} x {0} 

Yo :={0} X {I, 2, ... , 15} x {0} Fi :={0} x {0} x {I, 2, . . . , 15} 

Zo :={0} X {0} X {1, 2, ... , 15} Zi :={1, 2, . . . , 15} x {0} x {0} 

We claim that {Xq, Yq, Zq) and {Xi, Yi, Zi) satisfy the simultaneous triple product property. Suppose that 
i, j, k e {0, 1} and Qx G Q{X^, Xj), qy £ Q{Yj,Yk)^qz_€ Q{Zk, Zi). Suppose moreover that qx + Qy + qz = 0. 
If i = = fc, then we may argue as before that Xi,Yi,Zi satisfy the triple product property and therefore 
Qx = Qy = Qz = as desired. If k are not all equal, we may perform a case analysis for each of the six 
possible ordered triples {i,j,k), in each case obtaining a conclusion which contradicts the assumption that 
Qx + Qy + Qz = 0. We illustrate this by considering the case i = j = 0, fc = 1. In this case qx G Q^Xq^Xi^) 
has zero in its second component, as does qz G Q{Zi, Zq). But qy G Q{Yo, Yi) has a nonzero element in its 
second component. Thus the second component of qx + Qy + Qz is nonzero, contradicting our assumption 
that qx + qy + qz ^ 0. 

Lemma 4.7. // a group H has subsets {Xi,Yi, Zi : 1 < i < n} satisfying the simultaneous triple product 
property, then for every element hi: in H I Sym„ there is at most one way to represent Hit as a quotient 
{xa)~^yT such that x G nr=i -^i^ V ^ 11"=! -^i' Sym„. 
Proof. Let X HLi ^ — 117=1 Suppose that 

{xa)-^yT = [x'a'y^y'r' (13) 
and that x,x' G X, y,y' G Y, a,cr' ,t,t' G Sym„. We have 

{xa')^^yT — a^^x^^yr = (c^"'" • {x~^y)'j cr^'^r, 

and the right side of p3p may be expressed by a similar formula. Equating terms on both sides, we find 
that: 

^ait)y'^{^) = (^'a'{i)r^ya'(i) for 1 < i < n (14) 
(j-V = {a')-W. (15) 
Let j = a{i), k = cr'(i). We may rewrite ([T^ as 

4(^.)"'%(yfe)"' = 1- (16) 

The left side of (HI]) has the form qxqyqz where qx G Q{Xk,Xj), qy G Q(Yj,Yk), = 1 G Q{Zk,Zk). By 
the simultaneous triple product property, we may conclude that j — k, x — x' , y = y'. Recalling that 
j — a{i), k — <y'{i), we have aii) — cr'(i), and as i was an arbitrary element of {1, 2, . . . ,n} we conclude that 
a = a' . Combining this with (|15p implies that t = t' . Thus x = x' , y = y' , a — a' , t — t' , as desired. □ 
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Finally, we must recall some basic facts about the discrete Fourier transform of an abelian group. If H is 
an abelian group, we let H denote the set of all homomorphisms from H to , the multiplicative group of 
complex numbers with unit modulus. Elements of H are called characters and will be denoted in this paper 
by the letter The sets H, H have the same cardinality. When Hi , H2 are two abelian groups, there is a 
canonical bijection between the sets Hi x H2 and [Hi x H2)^; this bijection maps an ordered pair (xi,X2) 
to the character x given by the formula x(ft.i, = Xi(/''i)X2(^2). Just as the symmetric group Sym„ acts 
on i/" via the formula a ■ {hi, /12, . . . , hn) — {ha-^{i), ^'cr-i(2)j ■ • • j ^cr-i(n))j there is a left action of Sym„ 
on the set H" defined by the formula a ■ (xi,X2, ■ • ■ ,Xn) = {Xa-Hi),Xa-H2), ■ ■ -iXa-^n))- In the following 
section we will use the notation S(i?") to denote a subset of H" containing exactly one representative of 
each orbit of the Sym„ action on iJ". An orbit of this action is uniquely determined by a multiset consisting 
of n characters of H, so the cardinality of is equal to the number of such multisets, i.e. ('^'^^^). 

Example 4.8 (running example, part 3). A character x of the group H ~ (Z/16Z)'^ is uniquely determined 
by a triple (ai, 02, as) of integers modulo 16. For an element h — (61, 62, ^3) G H , we have 

X{h) — g2'^i(aif'i+'^262+a3b3)/16 

A character of the group H^ is a pair of ordered triples which may be represented as the rows of a matrix 

ail ai2 ai3 

0^21 0,22 0,23 

as before. The group Sym2 = {±1} acts on H^ by exchanging the two rows of such a matrix. An orbit of 
this action is either: 

• two distinct matrices, each obtained from the other by swapping the top and bottom rows; or 

• one matrix whose top and bottom rows are identical. 

There are 4096 rows that can be formed from an ordered triple of integers modulo 16, so there are (^"^j^^) 
orbits of the first type and 4096 orbits of the second type. Thus the set has cardinality 



4096 = 8,390,656. 



CD 

4.2 Abelian STP algorithms 

This section is based on the material from [8]. 

Definition 4.9 (abelian STP family). An abelian STP family with growth parameters {a, (3) is a collection 
of ordered triples {Hn,Tn, fc^y), defined for all > 0, satisfying 

1. Hn is an abelian group. 

2. Tjy — {Xi,Yi, Zi : i — 1,2, ... , N} is a collection of N ordered triples of subsets of Hn satisfying the 
simultaneous triple product property. 

3. \Hn\ = N'^+°^^\ 

4. kN = nti i^.i = nil \y^\ - nf=i \z^\ - n^^+''(^\ 

Remark 4.10. If {(-ffAr, Tjv, /cat)} is an abelian STP family, then Lemma [4. 71 ensures that there is a one- 
to-one mapping 



/ w \ / N \ 
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given by {x,y,(7,T) ^ (xcr) ^yr. The fact that the mapping is one-to-one imphes the first line in the 
following scries of inequalities. 

\HNfN\ > {kNNlf 

jyaN+o{N)jyN+o{N) > j^2l3N+o{N) j^2N+o{N) 

a + 1 > 2/3 + 2 

a — 1 a + I 

> > 2. 

P - (3 + 1 - 



Example 4.11 (running example, part 4). Example 14.81 contained an example of a group H with 4096 
elements which contained two triples of subsets, {Xq, Yq, Zq) and {Xi, Yi, Zi), satisfying the simultaneous 
triple product property. Each of the sets Xi, Yi, Zi {i ~ 0, 1) has 15 elements. 

We will now show how to extend this example to an abelian STP family. For > 1 let ^ = [log2(A^)] 
and let = H^. For I < i < N lei ii, 12, ■■■ ,ii denote the binary digits of the number i — 1 (padded with 
initial O's so that it has exactly t digits) and let 

ill 
Xi '■— JJ^ , Yi '■= W Yi^, Zi :— Y[ Zi^. 

rn—l m—l m— 1 

The triples {Xi, Yi, Zi) satisfy the simultaneous triple product property. Indeed, if k £ {1, 2, . . . , N} and 
qx e Q{Xi, Xj), qy £ Q{Yj,Yk), qz G Q{Zk, Zi),qx + + = 0, then for to = 1, 2, . . . , ^ it must be the case 
that the TO-th components of qx,qy, qz satisfy {qx)m + (%)m + (<?z)m — 0. Using this equation and applying 
the fact that {Xf),Yf), Zq) and {Xi,Y i, Zi) satisfy the simultaneous triple product property, we find that 
im = jm = km and that {qx)m = {%)m = {(lz)'m = 0. Since this holds for m = 1,2, ... it follows that 
i — j — k and qx — qy = qz = as claimed. 

Finally, we may work out the growth parameters of this abelian STP family. We have 

\Hn\ = = (163)l+Llog2(JV)J ^ ^31og2(16)+0(l/logAr)^ 

hence a = 31og2(16) — 12. Also, 

N N e 

fcAT = n l^^l = n n l^^-l = = 15^1og.(W)+0(iV) ^ ^JVlog,(15)+0(iV/log7V)^ 

i—1 i—1 m—l 

hence /3 = log2(15). 



Given an abelian STP family, one may define a recursive matrix multiplication which we now describe. 
Given a pair of n-by-n matrices A, B, we first find the minimum N such that fcjv • N\ > n, and we denote the 
group by H. If iV! > n, then we multiply the matrices using an arbitrary algorithm. (This is the base 
of the recursion.) Otherwise we will reduce the problem of computing the matrix product AB to ('^'"'jsf^"^) 
instances of A^! x A^! matrix multiplication, using a reduction based on the discrete Fourier transform of the 
abelian group . We next describe this reduction. 

Padding the matrices with additional rows and columns of O's if necessary, we may assume without loss 
of generality that ■ Nl — n. Define subsets X,Y, Z C Hi Sym^ as follows: 

X:^ X Sym^, F:- (jl^") ^ ^Jl^") 

These subsets satisfy the triple product property [3. Note that \X\ = |y| = \Z\ ~ n. We will treat the rows 
and columns of A as being indexed by the sets X, Y, respectively. We will treat the rows and columns of B 
as being indexed by the sets Y, Z, respectively. 
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The algorithm makes use of two auxUiary vector spaces <C[H I Sym^],C[i?^ x Sym^y], each of dimen- 
sionaHty \H\^N\ and each having a basis which we now designate. The basis for C[i? jSym^] is denoted by 
{sg : g € H I Sym^}, and the basis for C[H^ x Sym^] is denoted by {e^^^r : X ^ , a e Sym^Y}- 

The abeUan STP algorithm performs the following series of steps. We have labeled the steps according 
to whether they perform arithmetic or not. (For example, a step which permutes the components of a vector 
does not perform arithmetic.) 

1. Embedding (no arithmetic): Compute the following pair of vectors in C[H } Symjy]. 

xex yeY 
b := "^"^By^ey-i^. 

y£Y zeZ 

2. Fourier transform (arithmetic): Compute the following pair of vectors in C[i7^ xi Sym^]. 



X6 

b := 



E E f E Xih)b.') 



3. Assemble matrices (no arithmetic): For every x ^ ^{H^), compute the following pair of 
matrices A^, B^, whose rows and columns are indexed by elements of Symjy. 



4. Multiply matrices (arithmetic) : For every x G ^{H^), compute the matrix product := A^B^ 
by recursively applying the abelian STP algorithm. 

5. Disassemble matrices (no arithmetic) : Compute a vector c := a ^x.cr^x.o' C[i/^ xi Sym^] 
whose components c^^a are defined as follows. Given x, ct, let xo G E,{H^) and t G Sym^ be such that 
X = T-Xo- Let 

6. Inverse Fourier transform (arithmetic): Compute the following vector c £ C[iJ I Sym^]. 



-=E E 

/leff" o-eSym, 



7. Output (no arithmetic): Output the matrix C — {Cxz) whose entries are given by the formula 



See [8] for a proof of the algorithm's correctness. 
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Example 4.12 (running example, part 5). In our example with H = (Z/16Z)^ and iV = 2, we have 
k^Nl — (15^) (2!) = 450, so the seven steps outlined above constitute a reduction from 450-by-450 matrix 
multiplication to a large number of 2-by-2 matrix multiplication problems, i.e. of them. We will 

elaborate on the details of this reduction in the following paragraph. Recall from Example l4.8l that |E!(i?^)| — 
8,390,656. By comparison, the naive reduction from 450-by-450 to 2-by-2 matrix multiplication — by 
partitioning each matrix into (225)^ square blocks of size 2-by-2 — requires the algorithm to compute 
(225)'^ = 11,390,625 smaller matrix products. If we use this more efficient 450-by-450 matrix multiplication 
algorithm as the recursive step in a stationary partition algorithm as in Section [3. 1[ the running time would 
be 0{n^'^^). Instead, if we use the N = 2, H = (Z/16Z)^ construction as the basis of an abclian STP family 
as in Example I4.11[ we may apply the abelian STP algorithm which uses a more sophisticated recursion 
as the size of the matrices grows to infinity. For example, when N = 2^, we have n = fc^vA^! = 15^^(2^)!. 
The first step of the stationary partition algorithm would reduce an n-by-n matrix multiplication problem 
to a set of (n/450)-by-(n/450) matrix multiplication problems. By comparison, the first three steps of the 
abelian STP algorithm reduce n-by-n matrix multiplication to a set of (A'^!)-by-(A'^!) matrix multiplications. 
As N\ = 0{n'^'^^) in this example, we see that the abelian STP algorithm achieves a much more significant 
reduction in the size of the matrices at the top level of recursion. For the abelian STP algorithm in our 
running example, it can be shown that the running time is 0(n^'^^). 

We will now go into greater detail in explaining the abelian STP algorithm in the case N = 2, H = 
(Z/16Z)'^ given in our running example. In this case, H I Sym2 is the wreath product group described in 
Example 14.31 its elements are represented by ordered pairs (M, q) where M is a 2-by-3 matrix of integers 
modulo 16 and q = ±1. The sets X,Y, Z C H I Symj can be represented as follows: 

By this, we mean that an element of X is an ordered pair (Af , q) where M contains nonzero numbers in the 
upper right and lower middle entries, and zero in every other entry, and q is in {±1}. The interpretation of 
the expressions for Y and Z is analogous. 

The first three steps of the algorithm perform preprocessing on the matrix A to arrange some linear 
combinations of its entries into a set of 2-by-2 matrices, one for each element of Likewise, they 

preprocess the matrix B to arrange linear combinations of its entries into a set of 2-by-2 matrices. We 
will describe the preprocessing of A; the preprocessing of B is entirely analogous, but uses the subsets 
Y,ZC HlSym2 in place of X, Y. The group iJi Symj can be partitioned into two subsets of size = 16^, 
namely the elements (M, g) whose second component is +1 and those whose second component is —1. (We 
will call these the positive and negative subsets.) The first step in the preprocessing of A inserts its entries into 
two 6-dimensional arrays a^, a_ of size 16^, which we call the positive and negative arrays, corresponding 
to the positive and negative subsets of H I Sym2. For example, the matrix A contains an entry in row 
9 \ ,A , , 11 

5 y ' 
we may compute that 



Q g Q , , 1 I and column J'^llg g 4)'^)' because x & X and y & Y. In. H I Sym2 



-1 // 11 \ / 5 \ A // 6 \ ^ 

This tells us that the entry A^y in row x, column y of A should be inserted into a_ (because the second 
component of x~^y is -1) at the location whose index in the 6-dimensional array is the 6-tuple (0, 6, 0, 7, 0, 4). 
The locations of the other entries of A are determined similarly. At the end of this step, some of the entries 
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of the positive and negative arrays will not have been filled with an entry of A; the algorithm writes in 
these entries of the positive and negative arrays. 

The second step in the preprocessing of A performs a 6-dimensional discrete Fourier transform on the 
positive and negative arrays. That is, we compute the array a+ whose entries are given by the formula: 

15 15 15 15 15 15 / 2Tri * \ 

a+(il,«2,»3,i4, j5,«6) = X! X! X! X! X! X! ( 'iQ^^''^'' ) °-ihj2j3jtj5j6)- 

Ji=0j2=0i3=0i4=0j5=0i6=0 \ k=i ) 

This may be computed using the fast Fourier transform. An array a_ is defined similarly, using the entries 
of the negative array instead of the positive array. 

The third step in the preprocessing of A forms a 2-by-2 matrix Ai^ for each element x ^ '^{H'^)- The 
formula is given in step [3] above. Recall that an element of 'E.{H'^) can be represented by a 2-by-3 matrix 
of integers modulo 16, subject to the condition that if two such matrices differ only by swapping the top 
and bottom rows, then exactly one of them belongs to For notational convenience, we will write the 

entries of a matrix ( *^ ) as a 6-tuple (ii, 22, ?3, ?4, is, le)- If X = (*ii *2, *3, *4, *5, ^e) then A-^ is the 
V «4 «5 «6 / 

matrix 

f a+{ii,i2,iz,iA,ib,i&) a-(ji, ^2, i3, «4, is, ^e) \ 

\ a_(i4,Z5, 16,^1, Z2,i3) a+(i4,»5,«6,«l,i2,«3) / 

Note that if ii = i2 = *5; *3 = ^6 then this matrix contains only two distinct numbers, each repeated twice. 
The preprocessing of B is performed similarly, resulting in matrices for each x ^ '^i^'^)- The algorithm 
then computes each matrix product = A^B^. 

Finally, there is a three-step postprocessing phase which reconstructs the entries of the matrix product 
C = AB by taking linear combinations of the entries of the matrices C^. The first step is to arrange 
the entries of the matrices into a pair of arrays c+, c_ by reversing the mapping which was used to 
assemble the entries of a+, a_ into the matrices A^. Thus, for a 6-tuple x -—{ii, *2, *3, *4, ^5, ie), if X S "(^^) 
then c+(i),c_(i) are the entries of the first row of and if i ^ 'E.{H^) then c_(i),c+(i) are the entries of 
the second row of where x' :=(*4j ^5, *6i ^ii ^2, *3)- Having constructed the arrays c+, c_, we perform an 
inverse Fourier transform to obtain arrays 0+ , c_ . Finally, to determine the entry Cxz of the product matrix 
C — AB, we compute the element x~^z in the wreath product 7J ;Sym2, select the array c+ or c_ according 
to whether the second component of x~^z is +1 or —1, and look up the entry in this array whose index is 
the 6-tuple given by the entries of the matrix which forms the first component of x~^z. 

4.3 Analysis of abelian STP algorithms 

Now we are in a position to analyze abelian STP algorithms. We could have done that using Theorem 13.51 
However, we choose to further refine our error analysis to obtain sharper norm inequalities for a specific 
matrix norm and hence better error bounds. 

Theorem 4.13. // {(iJ^r, Tjv, ^jv)} is an abelian STP family with growth parameters {a,f3), then the corre- 
sponding abelian STP algorithm is stable. It satisfies the error bound with the Frobenius norm and the 
function ji of order 

Proof. We seek to establish that the matrix C'comp computed by the algorithm differs from the actual matrix 
product C by at most /i(n)£|l AHfHSHf + O(e^) in Frobenius norm, i.e. 

\\C,omp - C\\f < fi{n)e\\AMB\\F + 0{e^). 

Throughout this proof we will adopt the convention that the Fourier transform of an abelian group H is 
represented by a matrix F satisfying FF~^ — \H\I, rather than a unitary matrix. This is consistent with the 
interpretation that the Fourier transform takes an element x G C[i/], represented as a linear combination of 
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basis elements h H, and returns the coefficients in the unique representation of x as a linear combination 
^x^X' where the elements are idempotent in C[-ff]. When the Fourier transform is instead normalized 
so that it is represented by a unitary matrix, we will refer to this linear transformation as the "unitary Fourier 
transform." Let f{n) be the L2 error bound satisfied by the unitary Fourier transform and its inverse, i.e. if 
H is an abelian group with n elements, a; is a vector in C[-ff], and x, x are its unitary Fourier transform and 
inverse Fourier transform, then Hicomp — ^Ib and \\xcomp — x\\2 are both bounded by /(n)£||a;||2 + 0(6^). 

An abelian STP algorithm satisfies a Frobenius-norm error bound of the form /Lt(n), where the function 
fj.{n) satisfies a recursion which is governed by the recursive structure of the algorithm itself. Specifically, the 
algorithm break down into a series of seven steps specified in Section 321 Observe that arithmetic operations 
are performed only in the even-numbered steps. The odd-numbered steps consist only of rearranging (and 
possibly repeating) the components of a vector to form the entries of a set of matrices and vice-versa. (To 
see that no arithmetic is performed in Step [Tl use Lemma 14.71 which implies that each component of the 
vectors a, 6 is a sum of either zero or one entry of one of the matrices A,B.) 

Step [1] replaces the matrix A with a vector whose 2- norm is equal to the Frobenius norm ||A||f, and 
similarly for B. Step [2] performs Nl copies of the discrete Fourier transform of the group . We have 

\\&c.omp ^&\\l< m'^fUHn's^AlW' + 0{e') (17) 

and similarly for B. (The extra factor of on the right side comes from the fact that we're using a 

Fourier transform which is a unitary matrix multiplied by the scalar |i/|^/^.) 

Step [3] doesn't perform any arithmetic, but it repeats each component of a (resp. b) possibly A^! times 
in assembling a set of matrices {A^} {resp.{B^}). Let Err^ ^iomp ~ We have 

EP^IIf' < N\\H\^\\A\W' (18) 
EllErrlllp' < m\H\''fi\H\''fe'\\A\W' + Oie'). (19) 

The matrices Err^ are defined similarly, and they satisfy a similar bound on the sum of their squared 
Frobenius norms. 

Step |4] multiplies each pair A^^^^, B^omp to obtain a matrix C^omp- The error matrix 

can be expressed as a sum of four terms, as follows: 

-PyyX _ (f<x _ /IX ux ^ I (AX f>x — r ) 

C y^comp comp comp/ ~ \ comp comp X/ 

- (C,^„„p - A?_^BX_p) + [{A^ + Err^)(S'^ + Err|) - A^B^] 
= {C^,„,p - A^^^pB^,^^) + Err^S^ + A^Err^ + Err^Err^ . 



The fourth term is of order 0{e'^) and may be ignored. The remaining terms may be dealt with as follows. 
First, by the inductive hypothesis: 



IC^omp-A^ompB^omplW < /i(A^!)ePLnpl|F||S?o™pl|F + ©(e^) 

f,{m)e\\A^F\\B^F + 0{e^). 



Next, 



lErr^B^llF < ||Err^||F||i?^||F 
I^^Err^llF < px||F||Err|||F. 
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Summing all of these bounds over % S S, we obtain 

5^||Err^||F < M^!)eEll^''IIH|5'^llF + El|Err^llF|l^''llF + Ell^''llF||Er^BllF + 0(6^) 



X 

1/2 / ^ 1/2 



1/2 , ^ 1/2 

EllErr^llp') (Ell^'" ' 



?XI|F 



/ \l/2 / X 1/2 



< A^(iV!)7V!|HreP||F||i?||F + 2/(|i/r)7V!|i/re||A||F||S||F + 0(e2 
EllErr^llp') < [2f{\Hn+f,{m)]m\H\^s\\AMB\W + 0{e'). 



1/2 

2 



The second line was derived from the first by applying Cauchy-Schwarz three times. The third line was 
derived using (fT8|) and (fT9| . The final line was derived using the inequality ||a;||2 < ||a:||i, applied to the 
vector whose components are (||Err^l|F)xeH- 

In Step El we form a vector Ccomp in Fourier space whose components are a subset of the entries of the 

/ 2\ 1/2 

matrices C^omp- Om' uppcr bound on f H^'i'i'x IIf ) remains a valid upper bound on ||ccomp — c\\2- 

In StepIH we apply the inverse Fourier transform to Ccomp, to obtain a vector Ccomp- The inverse Fourier 
transform performed in this step is a unitary Fourier transform multiplied by the scalar so 

\\ccomp-ch < \Hr''^^f{\H\'')e\\cco^ph + \H\-''^^\\cco^p-ch + 0{e^) 

< \H\-''^'f{\Hns\\ch + [2f{\Hn + /i(7V!)] Nl\H\''/'e\\AMB\\F + 0{e') 

< f{\H\'')e\\AMBy + [2f{\Hn + ^^{Nl)] Nl\H\''^'s\\AMBy + 0{e'). 

The second line was derived from the first by observing that ||ccomp||2 < ||c||2 + 0(e) and by substituting our 
earlier bound for ||ccomp — c||2- The third line was derived by using the bound ||c||2 < |-ff|^/^l|Al|Fl|i?||F, which 
follows from the fact that c is the Fourier transform of the vector c, whose L2-norm is ||C||f < II^I|f||S||f- 
(Recall that the Fourier transform increases L2 norms of vectors by a factor of \H\^/^.) 

In Step [21 no further error is introduced. Thus, the matrix Ccomp — C has Frobenius norm bounded by 

WCcomp - C||f < [/(Iff 1"^) + 2m\H\'"^f{\Hf) + m\Hf'^ti{N\)\ ePIlFllBllF + 0{e^). (20) 

Let m = Nl, and recall that n = 771/5+1+0(1)^ \H\^ — to"^°^^^ The error bound (PO)) leads to the recursion 

^(^/3+l+o(l)) < rnl+"/2+o(l)/(™"+o(l)) +„jl+"/2+o(l)^(^)_ 

Assuming that the Fourier transform is implemented using the Cooley-Tukey FFT (see, e.g., [15]), we have 
f{n) — 0{logn). Solving the recursion, we find that 

□ 

Remark 4.14. Note that we could apply Theorem 13.51 directly, with the pre-processing map performing 
steps [1] through [3] of the algorithm, and the post-processing map performing steps [5] through [T] From the 
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discussion in this section wc sec that the operator norms of these maps subordinate to the Frobenius norm 
are bounded as 

||Pre„||„p < (iV!)i/2|i/|^/2, ||Post„|Uj, < \H\-^/\ 
while the error functions /pre and fpost are bounded by 

hre{n) < (7V!)i/2|ff|^/2/(|i/|^), < |i/r^/2/(|iJl^). 

FinaUy, the number of blocks t is ('^'^"^) « \H\'^ /Nl. This leads to a bound 

WCcomp - C||f < [\H^^'^liNl) + 2|ff |^(7V!)-i/2;(|^|^) + m\H\^/' f{\H\'')] e||A||F||B||F + 0{e% (21) 
which is somewhat weaker than ()20p . From (|2ip . we then obtain the recursion 

which gives 

Remark 4.15. The running time of an abelian STP algorithm can also be bounded in terms of the growth 
parameters of the abelian STP family. Specifically, the running time is O + . See [8] for details. 

Note that the sum of the two exponents, (a— 1)//? and (q; + 2)/2/3, is always bigger than 3, since a > 2/3+ 1: 

a-1 a + 2 3a 6/3 + 3 „ 
> > 3. 



13 2(3 2/3 - 2/3 



4.4 Analysis of examples 



The abelian STP algorithm in our running example has growth parameters a — 12, /3 — log2(15), hence its 
running time is 



and the error bound is 

Note that this error bound (in the Frobenius norm) implies a bound of 0{n^'^'^) in the entrywise maximum 
norm. This compares favorably with the error bound for Strassen's algorithm, which is 0{n^'^^) in the 
entrywise maximum norm, while nearly matching the exponent in the running time of Strassen's algorithm, 
which runs in time O (?T.'°8^2('i')) = Q (^7j2-8i^ Many other examples of abelian STP algorithms are listed in [7]. 
The algorithms with running time 0{n'^'^^) in Propositions 3.8 and 4.5 of [7| are both based on an explicit 
construction of an abelian STP family with growth parameters a = 31og4(6),/3 = log4(5), hence the error 
bound for these algorithms is 0{n^'^^). The algorithm with running time 0{n'^'^^) described in Theorems 
3.3 and 6.6 of [7] is based on an abehan STP family with growth parameters a = 3 logg 75(10), /3 — logg 75(8), 
hence the error bound for this algorithm is 0{n'^^'^^). 



5 Stability of linear algebra algorithms based on matrix multipli- 
cation 

It is natural to ask what other linear algebra operations can be done stably and quickly by depending on 
the stability of fast matrix multiplication described here. Indeed, "block" algorithms relying on matrix 
multiplication are used in practice for many linear algebra operations [2 [3], and have been shown to be 
stable assuming only the error bound ^ |T2]. In a companion paper fTV, we show that while stable these 
earlier block algorithms are not asymptotically as fast as matrix multiplication. However, [llj also shows 
there are variants of these block algorithms for operations like QR decomposition, linear equation solving 
and determinant computation that are both stable and as fast as matrix multiplication. 
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